Project Ensemble Techniques

By Ajay Kumar

• DOMAIN: Telecom

• CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

• DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

• Customers who left within the last month – the column is called Churn

• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

• Demographic info about customers – gender, age range, and if they have partners and dependents

Importing the necessary libraries

1. Import and warehouse data:

• Import all the given datasets from MYSQL server. Explore shape and size.

• Merge all datasets onto one and explore final shape and size.

After merging all the dataset the final shape is (7043,21) i.e dataset has 7043 entries and 21 atributes

2. Data cleansing:

• Missing value treatment

• Convert categorical attributes to continuous using relevant functional knowledge

• Drop attribute/s if required using relevant functional knowledge

• Automate all the above steps

The dataset has some null values.

The data types of features are 1 float, 2 integer and 18 object

Most of the features are categorical in nature.

The Dataset has no has some null values. Replacing the null by mean, median and mode.

The Dataset has no duplicate values

The value counts of the categorical variable tells us the no. of levels and the distribution of the class in each features

Now the dataset has no null values

The categorical features with No internet service and No phone service has the same meaning as No so replacing this with No.

Here, I have label encode the categorical variable for further analysis

The above boxplot says ther is no outliers in the atributes

Feature selection using ExtraTreesClassifier

3. Data analysis & visualisation:

• Perform detailed statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Statistical summary states that mean and standard deviation of the atributes are not in one scale

Minimum tenure is 0 and maximum is 72 months

The Dataset needs to be normalized

Univariate Analysis

Tenure: The data looks uniformally distributed with two peaks at bottom and top. It has no outliers

MonthlyCharges: Slightly right skewed with no outliers

TotalCharges: Highlly skewed towards right side and no outliers

Features have 2 or 3 classes

The class of most of the features are not balanced.

Bivariate Analysis

The above boxplot says that if monthly charges is high the chances of churning is also high

The customer who have the posibility of churning if the TotalCharges are below 4000.

There is a strong +ve linear corelation between these two varibles. There is high variance in the dataset also.

This scatterplot indicates that ther is a high variance in dataset for predicting the class of target and they are distributed like cloud

This feature is not a good predictor for target

The above plot says the probability of churning of the customer is more if the charges are more for all the class of InternetService

The customer having MultipleLines has the probability of churning. This feature is required for prediction

The above figure says customer with InternetService has also probability of churning

Contact is also a good contributer for predicting the churn for the customer who belongs to month-to-month (Zero) contract

PaperlessBilling is also a good predictor for churn

Multivariate Analysis

The corelation heatmap indicates that the features are +vely and -vely corelated with each other

TotalCharges and MonthlyCharges are highlty +ve corealated to each other

There is -ve corelation between churn and contract

The Pairplot is not giving much information exept some +ve corelation between MonthlyCharges and TotalCharges

Creating dummies for categorical variables

4. Data pre-processing

• Segregate predictors vs target attributes

• Check for target balancing and fix it if found imbalanced.

• Perform train-test split.

• Check if the train and test data have similar statistical characteristics when compared with original data.

The target variable is imbalanced supporting 73% for class 1 and 27% for class 0

Using SMOTE library from imblearn to generate synthetic the data for minority class

After using SMOTE library now both the classes for target variable is balanced which can be seen above i.e 3621 for each class. SMOTE has generated synthetic data for minority class

5. Model training, testing and tuning:

• Train and test all ensemble models taught in the learning module.

• Suggestion: Use standard ensembles available. Also you can design your own ensemble technique using weak classifiers.

• Display the classification accuracies for train and test data.

• Apply all the possible tuning techniques to train the best model for the given data.

• Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.

• Display and compare all the models designed with their train and test accuracies.

• Select the final best trained model along with your detailed comments for selecting this model.

• Pickle the selected model for future use

Decision Tree Model

The model get overfitted as the tree has over grown because I have not done any hyperparameter tuning

Now I will regularise the tree by hyper parameter tuning

The model has got overfitted as there is no hyperparameter tuning done

Regularizing the Tree to over come overfitting

Now after prunning tree the model has come up with the accuracy of 74% and 64% for trainig and testing respectivily

**The feature importance says the contact and monthly charges are good predictor for the target class

Train data accuracy: 77%

Classificatio report summary:

The precision and recall is very good for predicting 'Not Churn' is 80% and 73% respectivily

The precision and recall is also good for predicting 'Churn' is 75% and 81% respectivily

Confusion matrix summary:

True Positives (TP): we correctly predicted for Pass is 3363

True Negatives (TN): we correctly predicted for Fail is 3.36

False Positives (FP): we incorrectly predicted for Pass is (a "Type I error") 1103 Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted for Fail is (a "Type II error") 776 Falsely predict negative Type II error

Now lets check for testing data.

Testing data accuracy: 72% due to high variance in the dataset and class imbalance

Classificatio report summary:

The precision and recall is very good for predicting 'Not Churn' is 88% and 72% respectivily

The precision and recall is also good for predicting 'Churn' is 49% and 74% respectivily

Confusion matrix summary:

True Positives (TP): we correctly predicted for Pass is 275

True Negatives (TN): we correctly predicted for Fail is 745

False Positives (FP): we incorrectly predicted for Pass is (a "Type I error") 290 Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted for Fail is (a "Type II error") 99 Falsely predict negative Type II error

The overfitting has reduced after pruning

Random Forest Model

The Random Forest model is giving slightly better 75% accuracy on testing data

The Precision for the class 1 has improvev to 52% where as recall is giving score of 72%

There is a slight overfitting in FR model due to over grown

Bagging

The Bagging Model is slightly get overfitted on Training Data

The Bagging model is giving 75% accuracy on testing data

The Precision for the class 1 is 53% where as recall is giving score of 70%

The Bagging model has also slightly facing the overfitting issue due to the class imbalance

Discriminant Analysis

The LDA model is giving slightly lower to above i.e 73% accuracy on testing data

The Precision for the class 1 has improvev to 49% where as recall is giving score of 76%

There is no overfitting in LDA model but the performance is poor in precision

Ada Boost Classifier

The Ada Boost model is giving slightly better 74% accuracy on testing data

The Precision for the class 1 has improvev to 51% where as recall is giving score of 78%

Slightly improved as compare to other but still precision is low

Gradient Boosting Classifier

The Gradient Boosting model is giving 76% accuracy on testing data

The Precision for the class 1 has improvev to 53% where as recall is giving score of 68%

Till now this model is giving a very good score as compare to above model for class 1 and also improved in predicting false positive

This model is giving a better score on precision and recall till now

Naive Bayes Model

The Naive Bayes Model is giving poor accuracy of 68% on testing data

The Precision for the class 1 is 44% where as recall is giving score of 80%

KNN Model

The Naive Bayes Model is giving accuracy of 71% on testing data

The Precision for the class 1 is 47% where as recall is giving score of 72%

Logistic Model

The Logistic Model is giving accuracy of 74% on testing data

The Precision for the class 1 is 50% where as recall is giving score of 74 which needs to be improve

Applying GridSearchCV on Gradient Boosting Classifier

Testing data accuracy on grid search: 74% due to high variance in the dataset and class imbalance

Classificatio report summary:

The precision and recall is very good for predicting 'Not Churn' is 87% and 77% respectivily

The precision and recall is also good for predicting 'Churn' is 51% and 67% respectivily

Confusion matrix summary:

True Positives (TP): we correctly predicted for Pass is 250

True Negatives (TN): we correctly predicted for Fail is 799

False Positives (FP): we incorrectly predicted for Pass is (a "Type I error") 236 Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted for Fail is (a "Type II error") 124 Falsely predict negative Type II error

Support Vector Machine

The SVM model has given 83% and 74% for both the training and testing data

where as the precision and recall for the class 1 is 50% and 74% respectively

Comparison of Different Models

Using Cross Validation Techinique

Concl:

1. After applying SMOTE on the imbalanced dataset and comparing all the above model I found applied Random search on Gredient Beesting model which is giving good accuracy for training data which is average 78% after doing cross validation where as precision and recall for class 1 is 53% and 68% respectivily

2. Gradient Boosting has done well on testing data for improving the precision on the testing data i.e 53%

3.Model selection: I will be picking these two for furure use

Pickle the selected model for future use

I have pickle the file in above mentioned location for future use

6. GUI development

• Design a clickable GUI desk application or web service application.

• This GUI should allow the user to input all future values and on a click use these values on the trained model above to predict.

• It should display the prediction.

A clickable GUI desk application or web service application was developed using tkinter library which can be used for future use

7. Conclusion and improvisation:

• Write your conclusion on the results.

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the telecom operator to perform a better data analysis in future.

Final Conclusion

The following can be interpreted on data collected for telecom company who wants to use their historical customer data to predict behaviour to retain customers. is as follow:

  1. Most of the features of the dataset having categorical in natute except only 4. There is a high variance in the Dataste and the class for the target variable where also imbalanced which leading to overfitting the model during the testing.
  1. During EDA and visualization it is found that categorical features are good participating in the target prediction of the class expept few.
  1. Model build using various machine learning algirithums has given very good score on Training data but very poor performance on testing data due to highly imbalanced class due to this the precision and recall where slightly lower on testing data.
  1. SMOTE is used to balanced the class but due to very less no of observation for Churn class the model accuracy where slighly lower for the test data.
  1. The Gradient Boosting Classifier is giving a good score on precision and recall for Churn class can be used for deployment purpose.